White Wine Quality Data Analysis Project by Yang Liu

Abstract

In this project, I have chosen to explore and analyze the white wine quality dataset. This dataset contains 4898 white wines with 11 variables on qualifying different attributes. An output variable is also given in the dataset which is the rating of each wine between 0 and 10. In this project, I will analyze the realations between the wine attributes and ratings, and I will explore if there is any strong relationship between the different attributes of the wines.

Dataset

In this section, I have loaded the data and the variable names are shown in the below.

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

Now let’s see the structure of the variables:

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

We can find there is an X variable there, which is just the indices of wines. Since there is the no missing data in this dataset, I just simply showed the summary for each variable in the below.

##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000

Univariate Plots Section

In this section, I will plot several histograms to explore the count distributionsof wines for different variables.

First let’s take a look at the ratings of the wines.

We can find the ratings of the wines follow a normal distribution with center at 6, which shows most of wines got ratings at 5 and 6.

Let’s take a look at the alcohol, we can find with higher alcohol percentage, the counts of wines are decreasing. Alcohol with about 9% have most counts and the data is left skewed.

Let’s take a look at the fixed acidity. We can find the most of wines has fixed acidity between 6 and 8 g/dm^3.

The above histogram is the count of total sulfur dioxide. We can find most of wines have total sulfur dioxide between 100 and 200 mg/dm^3.

This histogram shows the counts for wines with different pH. Most of wines have pH around 3.0 and 3.3.

This histogram shows the counts for wines with residual sugar, we can find most wines have residual sugar under 2.5 g/dm^3.

Last, let’s plot the histograms for every variable in the data under same plot.

Univariate Analysis

What is the structure of your dataset?

There are 4898 observations and 13 variables in this dataset. Among the vaiables, X is the index of the wines and quality is the rating for each wine, and their data type is int. The quality is dependent on all the other variables, which are properties of the wines and they have float data type.

What is/are the main feature(s) of interest in your dataset?

In this dataset, I’m interested in the relations between pH, alcohol and quality. I would like to explore if there is any strong relationship between them.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Density, volatile acidity and free sulfur dioxide may also support my investigation.

Did you create any new variables from existing variables in the dataset?

I didn’t create any new vaiables by far since I’m not familar with all the chemicals. For different chemicals, the standards of high or low is unclear.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

Some data are skewed to the left and some are normally distributed, there is no noticable or unusual distributions in the dataset.

Bivariate Plots Section

In this part, let’s take a look at some bivariate plots and try my interests on some variables of this dataset.

The above graph is the scatter plot of pH vs. alcohol. In this graph, we didn’t see any strong relationship between pH and alcohol.

The above graph is the scatter plot of residual.sugar vs. pH. In this graph, we didn’t see any strong relationship between residual.sugar and pH.

Density, volatile acidity and free sulfur dioxide may also support my investigation. pH, alcohol and quality

Above is the scatter plot of volatile.acidity vs. pH. My assumption is volatile acidity will affect pH, but from the scatter plot above we didn’t see a strong relationship between each other.

The above plot is total.sulfur.dioxide vs density. We can find with more sulfur dioxide, the density of wine increases.

Let’s take a look at the alcohol vs. density. We can find with the increase on alcohol, the density of the wine drops.

We can find with the plot of pH vs. density, there is no strong relationship between pH and density.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

What was the strongest relationship you found?

Multivariate Plots Section

Tip: Now it’s time to put everything together. Based on what you found in the bivariate plots section, create a few multivariate plots to investigate more complex interactions between variables. Make sure that the plots that you create here are justified by the plots you explored in the previous section. If you plan on creating any mathematical models, this is the section where you will do that.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Were there any interesting or surprising interactions between features?

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.


Final Plots and Summary

Tip: You’ve done a lot of exploration and have built up an understanding of the structure of and relationships between the variables in your dataset. Here, you will select three plots from all of your previous exploration to present here as a summary of some of your most interesting findings. Make sure that you have refined your selected plots for good titling, axis labels (with units), and good aesthetic choices (e.g. color, transparency). After each plot, make sure you justify why you chose each plot by describing what it shows.

Plot One

Description One

Plot Two

Description Two

Plot Three

Description Three


Reflection

Tip: Here’s the final step! Reflect on the exploration you performed and the insights you found. What were some of the struggles that you went through? What went well? What was surprising? Make sure you include an insight into future work that could be done with the dataset.